Robust machine translation for multi-domain tasks
نویسنده
چکیده
In this thesis, we investigate and extend the phrase-based approach to statistical machine translation. Due to improved concepts and algorithms, the quality of the generated translation hypotheses has been significantly improved in recent years. Still, the translation quality leaves a lot to be desired when going beyond traditional translation tasks, such as newswire articles, and when addressing more ambitious translation problems. We extend the state-of-the-art in phrase-based translation which enables us to build a robust translation system for multi-domain input. Robustness is hereby regarded as the ability to produce high quality translations for arbitrary input texts, e.g. automatic transcriptions of recognized speech or other unstructured, potentially noisy input. In this work, we focus on Arabic-English translation tasks. We study the search problem for phrase-based statistical machine translation in detail. For this, we examine the effect of the different models on the translation quality. Moreover, we make an explicit distinction between reordering (coverage) and lexical hypotheses in the pruning process and stress the importance of the coverage pruning to adjust the balance between hypotheses representing different reorderings (coverage hypotheses) and hypotheses with different lexical representations. We present constraints to solve the reordering problem in machine translation. To trim our translation system for multi-domain input and to improve the robustness built into the decoder, we apply domain adaptation to the language models and rerank the candidate translations using appropriate rescoring models. We also present our work on adjusting the vocabularies of the speech recognizer and the machine translation system in a preprocessing step and on predicting missing punctuation marks for automatically transcribed speech (in the actual translation process). Processing morphologically rich languages such as Arabic generally poses high demands on preprocessing. We show that the choice of the appropriate preprocessing strategy depends on the translation domain and on the structure of the input data. Experimental results emphasize how the proper choice of the preprocessing approach helps to increase the translation quality. In addition, we address the task of improving the translation quality by means of syntactically motivated feature functions within a reranking concept. Then, we investigate different datadriven approaches to the task of transliterating proper names. Often, such names are out-ofvocabulary terms and the intention is to preserve the names by transliteration. Finally, we show how human translators can be assisted by machine translation systems. We compare search strategies for interactive machine translation. The presented machine translation system achieves state-of-the-art performance and has been successfully applied to the large-scale Arabic-English GALE translation evaluations. Furthermore, the system was ranked among the top submissions for the NIST Open Machine Translation Evaluation 2006 and for the series of IWSLT evaluation campaigns.
منابع مشابه
Robust Speech Translation by Domain Adaptation
Speech translation tasks usually are different from text-based machine translation tasks, and the training data for speech translation tasks are usually very limited. Therefore, domain adaptation is crucial to achieve robust performance across different conditions in speech translation. In this paper, we study the problem of adapting a general-domain, writing-textstyle machine translation syste...
متن کاملA Robust FACTS Damping Controller Design to Mitigate Interarea Oscillations in a Multi-machine Power System
In this paper, damping of interarea oscillations using simultaneous coordination of static Var compensator (SVC) and power system stabilizer (PSS) is considered. To be effective in damping of oscillations, the best-input signal of power oscillation damper (POD) associated with SVC is selected using Hankel singular values (HSVs), and right-hand plane zeros (RHP-zeros). The 4-machine-2 area...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملLinear Mixture Models for Robust Machine Translation
As larger and more diverse parallel texts become available, how can we leverage heterogeneous data to train robust machine translation systems that achieve good translation quality on various test domains? This challenge has been addressed so far by repurposing techniques developed for domain adaptation, such as linear mixture models which combine estimates learned on homogeneous subdomains. Ho...
متن کاملFine-Tuning for Neural Machine Translation with Limited Degradation across In- and Out-of-Domain Data
Neural machine translation is a recently proposed approach which has shown competitive results to traditional MT approaches. Similar to other neural network based methods, NMT also suffers from low performance for the domains with less available training data. Domain adaptation deals with improving performance of a model trained on large general domain data over test instances from a new domain...
متن کامل